-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster HfFileSystem.find
#1809
Faster HfFileSystem.find
#1809
Conversation
The documentation is not available anymore as the PR was closed or merged. |
A
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mariosasko for the PR and all the tests. I reviewed the logic and it looks good to me but I'd prefer a second pair of eyes on this (@lhoestq?). I only spotted 1 potential issue when detecting the common path of a list of dirs.
Otherwise, I like the list_repo_tree
and get_paths_info
that are more consistent with the server-side API. I was wondering if we could get rid of list_paths_info
that is a mix of both, without any optimization. I saw it is used in datasets
but list_repo_tree
can be a drop-in replacement for it. In any case let's not do it in this PR that is already quite big.
@@ -226,18 +226,21 @@ def repo_type_and_id_from_hf_id(hf_id: str, hub_url: Optional[str] = None) -> Tu | |||
return repo_type, namespace, repo_id | |||
|
|||
|
|||
class LastCommitInfo(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is not your fault but I'm not a fan of those TypedDict
... 😕
I think we should open a separate PR to switch all of them (BlobLfsInfo
, LastCommitInfo
, BlobSecurityInfo
, TransformersInfo
, SafeTensorsInfo
) to dataclasses, in a backward compatible way (and with a deprecation warning when dict-only method is used). This way we'll finally have a single type to represent data returned by the server (now that you've removed ReprMixin).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Maybe this can be a "good first issue"?
if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded): | ||
# If the dircache is incomplete, find the common path of the missing and non-expanded entries | ||
# and extend the output with the result of `_ls_tree(common_path, recursive=True)` | ||
common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Python docs, common_path
is not guaranteed to be a valid path. For example,
>>> import os
>>> os.path.commonprefix(["data/train/", "data/test/"])
"data/t"
which can lead to an unexpected result in self._ls_tree(common_path, ...)
just below.
I saw there is os.path.commonpath
as well but I'm afraid it will not work on Windows. What we can do is to use PurePosixPath
like this:
from pathlib import PurePosixPath
from typing import List
def unix_commonpath(paths: List[str]) -> str:
"""Return the longest common subpath of a list of Unix paths."""
common_parents = set.intersection(
*[set(parent.as_posix() for parent in PurePosixPath(path).parents) for path in paths]
)
return max(common_parents, key=lambda p: len(p))
>>> unix_commonpath(["data/train/", "data/test/"])
"data"
Let me know if you think of a simpler solution. If we go with this one, could you add a few test cases to ensure it works cross-platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path)
calls will match directories like "data_with_a_suffix"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a simpler fix to this. I added a test to ensure it works.
"name": path, | ||
"size": 0, | ||
"type": "directory", | ||
"tree_id": None, # TODO: tree_id of the root directory? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Not sure there's one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! 🔥 🔥 🔥
if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded): | ||
# If the dircache is incomplete, find the common path of the missing and non-expanded entries | ||
# and extend the output with the result of `_ls_tree(common_path, recursive=True)` | ||
common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path)
calls will match directories like "data_with_a_suffix"
|
||
def info(self, path: str, **kwargs) -> Dict[str, Any]: | ||
resolved_path = self.resolve_path(path) | ||
def info(self, path: str, refresh: bool = False, revision: Optional[str] = None, **kwargs) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we want the info with detail=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec doesn't expose the detail parameter in .info
: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L639.
IMO, not exposing this parameter makes sense, as the .info
method should return all the existing info about a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make calls to isdir
pretty slow no ? We use it a lot in datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally isdir
calls info()
with detail=False
and info()
doesn't need to call expand=True
in this case, which is slow on the Hub side.
It's a bit out of scope of this PR though, we can see later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change to allow calling .info
with detail=False
(through the kwargs
) to support efficient exists
/isdir
/isfile
, but detail
is not epoxsed as the method parameter to follow the spec.
Co-authored-by: Lucain <[email protected]>
|
Ah yes, I meant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
I also opened #1815 (continuation of this PR) to apply this optimization to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for making the changes!
Supersedes #1443
Makes the
HfFileSystem.find
method faster by making it "one-shot" instead of callingls
on each subdirectory. It also makesHfFileSystem.info
faster by replacing the default implementation that fetches the parent path's/tree
with a call to the/paths-info
endpoint.To implement this, this PR adds the following methods to the
HfApi
:list_repo_tree
(to optimizeHfFileSystem.find
)get_paths_info
(to optimizeHfFileSystem.get_paths_info
)Additional improvements:
last_commit
,security
, etc.) to the output whendetail=True
to make it more consistent withHfAp
i'sRepoFile
(andRepoFolder
)NotImplementedError
= does not exist as per AbstractFileSystem.exists() catches every exception fsspec/filesystem_spec#1379 (comment)) and updates the test (it's more consistent withglob.glob
now)